CPSC 545/445 (Autumn 2003) - Class 9: Phylogenetic Tree Construction Module 3, Part 2 --- recap: Parsimony: find a tree relating given sequences such that number of substitutions (=mutations) is minimised Small Parsimony Problem: Given tree T with just the leaves labeled, find the labels for the internal nodes such that score S(T) is minimised. S(T) = number of substitutions (from parents to children) Large Parsimony Problem: Given n sequences, find the tree T (completely labeled) with the lowest score S(T). Note: Theta(n!) many different trees with n nodes, problem is NP-hard --- Approaches for solving the large parsimony problem (Note: enumaration is infeasible because number of possible trees grows to quickly - see above) Note: Consider unrooted trees (slightly smaller search space) -- 1) Greedy Constructive Search Idea: iteratively build tree by adding edges one at a time Algorithm (greedy construction): 1. Randomly choose 3 sequences and place on unrooted tree T (Note: for given sequences there is only one such tree) 2. Repeat Randomly select new sequence x and add to T resulting in larger tree T' such that S(T') is minimal over all T' that can be obtained by adding x to T (Note: the new edge to x can be branched off any of the existing edges, dividing it into two) Until tree is complete. Note: This algorithm is not guaranteed to construct optimal tree Running multiple times can give different results -> selecting best of these leads to improving solution quality over time [illustration of search step] -- 2) Branch & Bound Search Idea: Iteratively build trees as in method 1), but - use backtracking to systematically enumerate all trees - abandon particular construction whenever score of partial tree exceeds score of best complete tree obtained so far Algorithm (branch & bound): function bb-ptc(T,X) input: partial tree T with leaves labelled by sequences not in X, set of sequences X output: unrooted tree with leaves labelled by the elements of X if T = empty then select three sequences x1,x2,x2 \in X X' := X \ {x1,x2,x3}; T' := unique unrooted tree with leaves x1,x2,x3 return bb_ptc(T',X'); else if X != empty and S(T) <= bound then select sequence x \in X X' := X \ {x}; for n := 1 to number of edges in T do obtain T'_n from T by adding a new edge that connects x to edge n in T s_n := S(T'_n); T' := T'_n with min s_n if X' = empty then bound := min{bound,s_1,...,s_n}; end for end if end bb-ptc bound := \infty T_opt := bb-ptc(empty,X) Note: can substantially improve performance of algorithm by using a better choice for the initial bound (e.g., a value obtained by methods 1 or 3 below) (For another description of the same idea, see Durbin et al., p.178) [illustration of search tree(?)] Problem: Much more efficient then exhaustive enumeration, but still impractical for large number of sequences (# > 20) -- 3) Stochastic Local Search (SLS) Idea: Start with any tree (e.g., constructed as in 1), iteratively reduce cost by swapping branches Various types of branch swapping: - Nearest Neighbour Interchange, NNI select internal edge e in T note that e has two subtrees attached to each of its two incident nodes interchange two of these (two possibilities: AB-CD -> AC-BD, AD-BC) - Subtree Pruning and Regrafting, SPR: cut off a subtree T' from T reconnect the edge that connected T' to rest of T to another edge in T (eliminate node from which T' was cut off) - Tree Bisection and reconnection, TBR delete edge from T reconnect resulting two subtrees T', T'' by new edge between arbitrary edge in T' and arbitrary edge in T'' (eliminate node from which T' was cut off) [see http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html, http://www.hyphy.org/docs/analyses/methods/nni.html] [illustration of search steps] Algorithm (best improvement): 1. Construct a tree T (e.g., using method 1) 2. Repeat N(T) := set of all trees that can be obtained by NNI from T TT := set of T' from N(T)+{T} with minimal S(T') randomly choose T' from TT Until T'=T Note: A variant of this where the trees T' in N(T) are generated in some order and the first with S(T') < S(T) is accepted can give better performance (first improvement method) Problem: Can easily get stuck in local optima Solution: Allow worsening swaps (-> Randomised Iterative Improvement, Simulated Annealing, ...) Can also use other SLS methods, e.g., Evolutionary Algorithms, ... Note: Methods 1) and 3) can be combined - do SLS after every construction step in 1) This is the method underlying many, if not most, phylogenetic tree construction algorithms used in practice for larger numbers of sequences. --- Simultaneous Alignment and Phylogenetic Tree Construction (here: motivation + idea only) Note: Quality of phylogenetic tree depends on quality of underlying multiple sequence alignment Idea: simultaneously construct tree and alignment Basic approach (analogous to small + large parsimony problems): - find optimal alignment given tree - search over trees to find overall optimum How to find optimal alignment given tree: - linear gap score: use a simple variant of a multidimensional dynamic programming algorithm for multiple sequence alignment (see Durbin et al., p. 141): Instead of the standard subst score for individual residues (+ gaps), use a parsimony score for any partial alignment that can be computed by an upward pass through the given tree using a standard parsimony algorithm (e.g., Durbin et al., p.174 or p.175) [Sankoff & Cedergren, 1983] - affine gap score: algorithm by Hein [1989], uses variant of DP method for pairwise sequence alignment with affine gaps that is generalised to deal with multiple progressive alignments (for details, see Durbin et al., p.181ff) Note: Hein's algorithm works for modest size sets of sequence (tens of), but does not guarantee optimal results How to search over trees to find overall optimum: - use same methods as described above for the large parsimony problem. --- Resources: Durbin et al., Ch.7 Additional material: @incollection{ cottamoscato02:phylotrees:ppsn, author = "Cotta, C. and Moscato, P.", title = "Inferring Phylogenetic Trees Using Evolutionary Algorithms", booktitle = "Parallel Problem Solving From Nature VII", editor = "Merelo, J.J. and Adamidis, P. and Beyer, H.-G. and Fern\'andez-Villaca\~nas, J.-L. and Schwefel, H.-P.", series = "Lecture Notes in Computer Science", volume = "2439", pages = "720--729", publisher = "Springer-Verlag", address = "Berlin", year = "2002", url = "citeseer.nj.nec.com/cotta02inferring.html" } @misc{ salter-algorithms, author = "L. A. Salter", title = "Algorithms for Phylogenetic Tree Reconstruction", url = "citeseer.nj.nec.com/salter00algorithms.html" } http://biohpc.learn.in.th/contents2002/Worawit/worawit.pdf - nice slide set, but contains some errors http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html - nice illustration of neighbourhoods for branch swapping